Project 4: Explore and Summarize Data

Red Wine Quality by Laima Stinskaite

General information about dataset

My choice is “Red Wine Quality” dataset. Why? Because I am a big connoisseur of red wines. So I want to know more about them. Let’s take a quick look on the summary of this dataset:

## [1] 1599   13
##  [1] "X"                    "fixed.acidity"        "volatile.acidity"    
##  [4] "citric.acid"          "residual.sugar"       "chlorides"           
##  [7] "free.sulfur.dioxide"  "total.sulfur.dioxide" "density"             
## [10] "pH"                   "sulphates"            "alcohol"             
## [13] "quality"
## 'data.frame':    1599 obs. of  13 variables:
##  $ X                   : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##        X          fixed.acidity   volatile.acidity  citric.acid   
##  Min.   :   1.0   Min.   : 4.60   Min.   :0.1200   Min.   :0.000  
##  1st Qu.: 400.5   1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090  
##  Median : 800.0   Median : 7.90   Median :0.5200   Median :0.260  
##  Mean   : 800.0   Mean   : 8.32   Mean   :0.5278   Mean   :0.271  
##  3rd Qu.:1199.5   3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420  
##  Max.   :1599.0   Max.   :15.90   Max.   :1.5800   Max.   :1.000  
##  residual.sugar     chlorides       free.sulfur.dioxide
##  Min.   : 0.900   Min.   :0.01200   Min.   : 1.00      
##  1st Qu.: 1.900   1st Qu.:0.07000   1st Qu.: 7.00      
##  Median : 2.200   Median :0.07900   Median :14.00      
##  Mean   : 2.539   Mean   :0.08747   Mean   :15.87      
##  3rd Qu.: 2.600   3rd Qu.:0.09000   3rd Qu.:21.00      
##  Max.   :15.500   Max.   :0.61100   Max.   :72.00      
##  total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  6.00       Min.   :0.9901   Min.   :2.740   Min.   :0.3300  
##  1st Qu.: 22.00       1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500  
##  Median : 38.00       Median :0.9968   Median :3.310   Median :0.6200  
##  Mean   : 46.47       Mean   :0.9967   Mean   :3.311   Mean   :0.6581  
##  3rd Qu.: 62.00       3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300  
##  Max.   :289.00       Max.   :1.0037   Max.   :4.010   Max.   :2.0000  
##     alcohol         quality     
##  Min.   : 8.40   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.20   Median :6.000  
##  Mean   :10.42   Mean   :5.636  
##  3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :14.90   Max.   :8.000

Univariate Plots Section

Let’s create histograms for every variable in our dataset and output summary data for them:

fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    4.60    7.10    7.90    8.32    9.20   15.90

Fixed acidity plot seems to be normal distributed. Most popular fixed acidity value is 7 g/dm^3. Typical fixed acidity values are below 14 g/dm^3 with max at 15.9 g/dm^3.

volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.1200  0.3900  0.5200  0.5278  0.6400  1.5800

Volatile acidity plot seems to be normal distributed. Most popular volatile acidity value is 0.6 g/dm^3. Typical volatile acidity values are below 1 g/dm^3 with max at 1.58 g/dm^3.

citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   0.090   0.260   0.271   0.420   1.000

citric.acid distribution have 3 peaks: one is at 0 g/dm^3, the second - at 0.25 g/dm^3 and the 3rd - at 0.5 g/dm^3. Most popular citric acid value is around 0 g/dm^3. Typical citric acid values are below 0.75 g/dm^3 with max at 1 g/dm^3. Also it looks like an outlier at citric.acid = 1 g/dm^3. Boxplot proves it.

residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.900   1.900   2.200   2.539   2.600  15.500

Residual sugar plot seems to be right-skewed. It has a long tail which goes beyond 14 g/dm^3. We implemented log10 scale function to x axis to better understand the distribution of residual sugar. Most popular residual sugar value is around 3.5-4 g/dm^3. Typical residual sugar values are below 10 g/dm^3 with max at 15.5 g/dm^3.

chlorides: the amount of salt in the wine

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.07000 0.07900 0.08747 0.09000 0.61100

Chlorides plot seems to be right-skewed. It has a long tail which goes beyond 0.2 g/dm^3. We implemented log10 scale function to x axis to better understand the distribution of chlorides. Most popular chlorides value is around 0.08-0.09 g/dm^3. Typical chlorides values are below 0.15 g/dm^3 with max at 0.611 g/dm^3. Also it looks like an outlier after chlorides = 0.1 g/dm^3. Boxplot proves it.

free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   14.00   15.87   21.00   72.00

Free sulfur dioxide plot seems to be right-skewed. It has a long tail which goes beyond 40 mg/dm^3. We implemented log10 scale function to x axis to better understand the distribution of free sulfur dioxide. Most popular free sulfur dioxide value is around 7-8 and 10 mg/dm^3. Typical free sulfur dioxide values are below 40 mg/dm^3 with max at 72 mg/dm^3. Also it looks like an outlier after free.sulfur.dioxide = 40 mg/dm^3. Boxplot proves it.

total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    6.00   22.00   38.00   46.47   62.00  289.00

Total sulfur dioxide plot seems to be right-skewed. It has a long tail which goes beyond 150 mg/dm^3. We implemented log10 scale function to x axis to better understand the distribution of free sulfur dioxide. Most popular total sulfur dioxide value is around 60 mg/dm^3. Typical total sulfur dioxide values are below 120 mg/dm^3 with max at 289 mg/dm^3. Also it looks like an outlier after total.sulfur.dioxide = 120 mg/dm^3. Boxplot proves it.

density: the density of water is close to that of water depending on the percent alcohol and sugar content

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9901  0.9956  0.9968  0.9967  0.9978  1.0040

Density plot seems to be normal distributed. Most popular density value is between 0.995 and 1 g/cm^3. Typical density values are below 1.0025 g/cm^3 with max at 1.004 g/cm^3.

pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.740   3.210   3.310   3.311   3.400   4.010

pH plot seems to be normal distributed. Most popular pH value is between 3.25 and 3.5. Typical pH values are below 3.75 with max at 4.01. All wines are acidic (pH is less than 7).

sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, wich acts as an antimicrobial and antioxidant

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.3300  0.5500  0.6200  0.6581  0.7300  2.0000

Total sulphates plot seems to be right-skewed. It has a long tail which goes beyond 1 g/dm^3. We implemented log10 scale function to x axis to better understand the distribution of free sulfur dioxide. Most popular sulphates value is between 0.5 and 0.75 g/dm^3. Typical sulphates values are below 1.5 g/dm^3 with max at 2 g/dm^3. Also it looks like an outlier after sulphates = 1 g/dm^3. Boxplot proves it.

alcohol: the percent alcohol content of the wine

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.40    9.50   10.20   10.42   11.10   14.90

Total alcohol plot seems to be right-skewed. It has a long tail which goes beyond 13%. We implemented log10 scale function to x axis to better understand the distribution of free sulfur dioxide. Most wines have alcohol % between 9 and 13 with max at 14.9. I wonder how alcohol is connected to quality.

Quality describes a score between 0 (very bad) and 10 (very excellent). It is an output variable in our dataset, so we will be interested to check all its relationships with other variables from this dataset.

It seems that quality variable is categorical. From that point it will be better to factor it:

##   3   4   5   6   7   8 
##  10  53 681 638 199  18

Now we see that we have only 6 levels for quality. Let’s split these categories to ‘low’, ‘medium’ and ‘high’ to have a better visualization of quality distribution for red wines:

## Warning: Ignoring unknown parameters: binwidth, bins, pad

##    low medium   high 
##     63   1319    217
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.000   5.000   6.000   5.636   6.000   8.000

As we see from the histogram above, the most popular category for red wines quality is medium (its count = 1319 in comparison with 63 of low quality and 217 of high). The summary statistics also show the 1st quartile of ratings at 5 and the 3rd quartile at 6, so about 75% of the observations have an average quality. Also few red wines are in either low quality or very high quality. Also we see that some of the quality levels are missing in our dataset: 1, 2, 9, 10. Maybe it’s because this data consists of only one sort of wine: Portuguese “Vinho Verde”?

Univariate Analysis

What is the structure of your dataset?

Our dataset consists of 1599 observations and 13 variables. All variables are numeric. There is one output variable. It is a quality. And first variable (X) is the unique id for every observation.

What is/are the main feature(s) of interest in your dataset?

The main feature of interest in our dataset is quality of wine.

What other features in the dataset do you think will help support your
investigation into your feature(s) of interest?

There are 11 more features that can have impact on the wine quality. But I think that the most influence on the wine quality can cause pH, alcohol and sulphates. We will check it later.

Did you create any new variables from existing variables in the dataset?

Yes: quality_level and quality_label (they are described above).

Of the features you investigated, were there any unusual distributions?
Did you perform any operations on the data to tidy, adjust, or change the form
of the data? If so, why did you do this?

There are several plots with long tails. I used scale_x_log10 to look at them in more details. For some plots I played with binwidth a little. It helped me to better understand distribution of variables.

Bivariate Plots Section

First we will start from analyzing potential correlation between 2 variables of our dataset. We need to find what features can cause bad or good quality of wine. We suggested above that pH, alcohol and sulphates can influence on the wine quality. But we don’t have any prooves of that, so we need to create some plots and make correlation tests for every potential correlated pair from dataset.

For this purpose, first, we will be using ggpairs plot and correlation test. Creating ggpairs plot is important in bivariate analysis because it can quickly gives us clues for futher analysis. It consists of correlation coefficients and different kinds of plot: histograms, scatterplots and line charts for every pair of variables from our dataset.

##                      fixed.acidity volatile.acidity citric.acid
## fixed.acidity           1.00000000     -0.256130895  0.67170343
## volatile.acidity       -0.25613089      1.000000000 -0.55249568
## citric.acid             0.67170343     -0.552495685  1.00000000
## residual.sugar          0.11477672      0.001917882  0.14357716
## chlorides               0.09370519      0.061297772  0.20382291
## free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
## total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
## density                 0.66804729      0.022026232  0.36494718
## pH                     -0.68297819      0.234937294 -0.54190414
## sulphates               0.18300566     -0.260986685  0.31277004
## alcohol                -0.06166827     -0.202288027  0.10990325
## quality                 0.12405165     -0.390557780  0.22637251
##                      residual.sugar    chlorides free.sulfur.dioxide
## fixed.acidity           0.114776724  0.093705186        -0.153794193
## volatile.acidity        0.001917882  0.061297772        -0.010503827
## citric.acid             0.143577162  0.203822914        -0.060978129
## residual.sugar          1.000000000  0.055609535         0.187048995
## chlorides               0.055609535  1.000000000         0.005562147
## free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
## total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
## density                 0.355283371  0.200632327        -0.021945831
## pH                     -0.085652422 -0.265026131         0.070377499
## sulphates               0.005527121  0.371260481         0.051657572
## alcohol                 0.042075437 -0.221140545        -0.069408354
## quality                 0.013731637 -0.128906560        -0.050656057
##                      total.sulfur.dioxide     density          pH
## fixed.acidity                 -0.11318144  0.66804729 -0.68297819
## volatile.acidity               0.07647000  0.02202623  0.23493729
## citric.acid                    0.03553302  0.36494718 -0.54190414
## residual.sugar                 0.20302788  0.35528337 -0.08565242
## chlorides                      0.04740047  0.20063233 -0.26502613
## free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
## total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
## density                        0.07126948  1.00000000 -0.34169933
## pH                            -0.06649456 -0.34169933  1.00000000
## sulphates                      0.04294684  0.14850641 -0.19664760
## alcohol                       -0.20565394 -0.49617977  0.20563251
## quality                       -0.18510029 -0.17491923 -0.05773139
##                         sulphates     alcohol     quality
## fixed.acidity         0.183005664 -0.06166827  0.12405165
## volatile.acidity     -0.260986685 -0.20228803 -0.39055778
## citric.acid           0.312770044  0.10990325  0.22637251
## residual.sugar        0.005527121  0.04207544  0.01373164
## chlorides             0.371260481 -0.22114054 -0.12890656
## free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
## total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
## density               0.148506412 -0.49617977 -0.17491923
## pH                   -0.196647602  0.20563251 -0.05773139
## sulphates             1.000000000  0.09359475  0.25139708
## alcohol               0.093594750  1.00000000  0.47616632
## quality               0.251397079  0.47616632  1.00000000

As we see:

  1. quality has the moderate correlation with alcohol (~0.48)

  2. citric.acid and fixed.acidity are correlated (~0.67)

  3. fixed.acidity has strong correlation with density (~0.67)

  4. fixed.acidity has negative correlation with pH (~-0.68)

  5. free.sulfur.dioxide is correlated with total.sulfur.dioxide (~0.67)

  6. volatile.acidity has negative correlation with citric.acid (~-0.55)

So it means that we were right about correlation between wine quality and its alcohol level, but we were wrong about correlation between wine quality and pH, wine quality and sulphates. Of course, there are correlated but the coefficient is low.

Ok, from that point let’s take a closer look at 6 correlations that we found in ggpairs plot. For this puspose we will use boxplots and scatterplots with regression line to have more information about correlation and distribution of every 6 feature pairs.

## [1] 0.4761663

From this plot we see that wine quality generally increases as % of alcohol increases. A correlation coefficient of 0.48 supports this belief.

## [1] 0.6717034

We see that fixed acidity generally increases as citric acid increases. Correlation coefficient between these features = 0.67, and proves this theory. Also we see that we have a lot of points when citric.acid = 0 and fixed.acidity - not 0. It means that some wines don’t consist of this chemical that can add ‘freshness’ and flavor to wines.

## [1] 0.6680473

From this plot we see that fixed.acidity generally increases as density increases. A correlation coefficient of 0.67 supports this belief. Also we see that the most popular values of these chemicals are:

and

## [1] -0.6829782

We see that fixed acidity generally decreases as pH increases. Correlation coefficient between these features = -0.68, and proves this theory. Also there are a lot of points in the following area:

and

It means that the most wines are acidic.

## [1] 0.6676665

From this plot we see that total.sulfur.dioxide generally increases as free.sulfur.dioxide increases. A correlation coefficient of 0.67 supports this belief. Most popular values for free.sulfur.dioxide are between 0 and 20 mg/dm^3 and for total.sulfur.dioxide are 0 and 50 mg/dm^3. It means that there are a lot of wine which have not strong smell and taste.

## [1] -0.5524957

We see that volatile acidity generally decreases as citric acid increases. Correlation coefficient between these features = -0.55, and proves this theory. We see that the most popular values for citric acid are 0 and 0.5 g/dm^3 and for volatile acidity are 0.2 and 0.8 g/dm^3. It means that a lot of of wines in dataset don’t have an unpleasant, vinegar taste.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. How did the feature(s) of interest vary with other features in
the dataset?

As we saw from our investigation above, quality variable is correlated with alcohol (wine quality is higher when alcohol % is higher) and practically isn’t correlated with other 10 variables.

Did you observe any interesting relationships between the other features
(not the main feature(s) of interest)?

Yes, there are more strong correlations between other variables than with feature of interest = quality.

We found strong positive correlations between:

  • citric.acid and fixed.acidity

  • fixed.acidity and density

  • free.sulfur.dioxide and total.sulfur.dioxide

and strong negative correlations between:

  • fixed.acidity and pH

  • volatile.acidity and citric.acid

What was the strongest relationship you found?

The strongest relationship for feature of our interest is with alcohol. As for correlations between other features then the strongest is between fixed.acidity and pH.

Multivariate Plots Section

After we made bivariate analysis let’s take a look at the connections between quality and correlated between each other features. This is a multivariate analysis, and it’s very important because it can tell us more about correlations between combinations of features. We can miss very important connections for this data if we won’t make multivariate analysis.

So let’s take a look what relationships quality has with other correlated between each other features:

It seems that correlation between fixed acidity and citric acid don’t affect on the quality of wine.

From the plot above we see that when fixed acidity value is higher and density value is higher then quality of wine will be higher too.

If we take a look at the plot with free.sulfur.dioxide, total.sulfur.dioxide and quality then we can say that low quality wines consist of less than ~40 mg/dm^3 of free.sulfur.dioxide, high quality wines - less than ~55 mg/dm^3 of free.sulfur.dioxide. Also only wines with medium quality have free.sulfur.dioxide > 55 mg/dm^3. “free.sulfur.dioxide prevents microbial growth and the oxidation of wine”. It means that the higher value of free.sulfur.dioxide in wine the better quality it has, but it seems that when free.sulfur.dioxide is greater than 55 mg/dm^3 then wine quality becomes worse.

It seems that correlation between fixed acidity and pH don’t affect on the quality of wine.

If we take a look at plot with citric.acid, volatile.acidity and quality then we will see that low quality wines have higher volatile.acidity and lower citric.acid values. It proves that “the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste”.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the
investigation. Were there features that strengthened each other in terms of
looking at your feature(s) of interest?

I don’t see that features or corellations of other features than alcohol have a big impact on the quality but we see that volatile.acidity combined with citric.acid have effect on the wine quality.

Were there any interesting or surprising interactions between features?

Yes, in the plot with free.sulfur.dioxide, total.sulfur.dioxide and quality we see 2 points at total.sulfur.dioxide ~ 255 mg/dm^3. It seems that there are outliers in that place.

OPTIONAL: Did you create any models with your dataset? Discuss the
strengths and limitations of your model.

## 
## Calls:
## m1: lm(formula = quality ~ alcohol, data = rwq)
## m2: lm(formula = quality ~ alcohol + fixed.acidity, data = rwq)
## m3: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity, 
##     data = rwq)
## m4: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid, data = rwq)
## m5: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar, data = rwq)
## m6: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides, data = rwq)
## m7: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide, 
##     data = rwq)
## m8: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide, data = rwq)
## m9: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density, data = rwq)
## m10: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH, data = rwq)
## m11: lm(formula = quality ~ alcohol + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + density + pH + sulphates, data = rwq)
## 
## ====================================================================================================================================================
##                            m1         m2         m3         m4         m5         m6         m7         m8          m9         m10         m11      
## ----------------------------------------------------------------------------------------------------------------------------------------------------
##   (Intercept)            1.875***   1.206***   2.674***   2.622***   2.626***   2.663***   2.691***   2.890***   15.660     -19.976      21.965     
##                         (0.175)    (0.196)    (0.218)    (0.219)    (0.219)    (0.229)    (0.237)    (0.243)    (17.684)    (20.943)    (21.195)    
##   alcohol                0.361***   0.368***   0.321***   0.325***   0.325***   0.322***   0.322***   0.306***    0.296***    0.337***    0.276***  
##                         (0.017)    (0.016)    (0.016)    (0.016)    (0.016)    (0.017)    (0.017)    (0.017)     (0.022)     (0.026)     (0.026)    
##   fixed.acidity                     0.071***   0.036***   0.056***   0.056***   0.055***   0.054***   0.043**     0.052**    -0.007       0.025     
##                                    (0.010)    (0.010)    (0.013)    (0.013)    (0.013)    (0.014)    (0.014)     (0.018)     (0.026)     (0.026)    
##   volatile.acidity                            -1.286***  -1.420***  -1.416***  -1.403***  -1.406***  -1.320***   -1.308***   -1.287***   -1.084***  
##                                               (0.099)    (0.115)    (0.115)    (0.117)    (0.118)    (0.120)     (0.121)     (0.121)     (0.121)    
##   citric.acid                                            -0.314*    -0.308*    -0.283     -0.282     -0.133      -0.133      -0.151      -0.183     
##                                                          (0.137)    (0.138)    (0.145)    (0.145)    (0.150)     (0.150)     (0.150)     (0.147)    
##   residual.sugar                                                    -0.004     -0.004     -0.003      0.002       0.007      -0.008       0.016     
##                                                                     (0.012)    (0.012)    (0.012)    (0.012)     (0.014)     (0.015)     (0.015)    
##   chlorides                                                                    -0.215     -0.217     -0.332      -0.321      -0.696      -1.874***  
##                                                                                (0.383)    (0.383)    (0.383)     (0.383)     (0.400)     (0.419)    
##   free.sulfur.dioxide                                                                     -0.001      0.004*      0.004*      0.005*      0.004*    
##                                                                                           (0.002)    (0.002)     (0.002)     (0.002)     (0.002)    
##   total.sulfur.dioxide                                                                               -0.003***   -0.003***   -0.003***   -0.003***  
##                                                                                                      (0.001)     (0.001)     (0.001)     (0.001)    
##   density                                                                                                       -12.797      25.114     -17.881     
##                                                                                                                 (17.720)    (21.370)    (21.633)    
##   pH                                                                                                                         -0.611**    -0.414*    
##                                                                                                                              (0.194)     (0.192)    
##   sulphates                                                                                                                               0.916***  
##                                                                                                                                          (0.114)    
## ----------------------------------------------------------------------------------------------------------------------------------------------------
##   R-squared                  0.2        0.3        0.3        0.3        0.3        0.3        0.3        0.3        0.3         0.3         0.4    
##   adj. R-squared             0.2        0.2        0.3        0.3        0.3        0.3        0.3        0.3        0.3         0.3         0.4    
##   sigma                      0.7        0.7        0.7        0.7        0.7        0.7        0.7        0.7        0.7         0.7         0.6    
##   F                        468.3      266.5      253.1      191.6      153.2      127.7      109.4       98.0       87.2        79.9        81.3    
##   p                          0.0        0.0        0.0        0.0        0.0        0.0        0.0        0.0        0.0         0.0         0.0    
##   Log-likelihood         -1721.1    -1696.2    -1615.4    -1612.7    -1612.7    -1612.5    -1612.4    -1606.1    -1605.9     -1600.9     -1569.1    
##   Deviance                 805.9      781.2      706.1      703.8      703.7      703.6      703.5      698.0      697.7       693.4       666.4    
##   AIC                     3448.1     3400.5     3240.8     3237.5     3239.4     3241.1     3242.8     3232.2     3233.7      3225.7      3164.3    
##   BIC                     3464.2     3422.0     3267.6     3269.7     3277.0     3284.1     3291.2     3286.0     3292.9      3290.2      3234.2    
##   N                       1599       1599       1599       1599       1599       1599       1599       1599       1599        1599        1599      
## ====================================================================================================================================================

I decided to try linear model using quality ~ alcohol because only alcohol has correlation with quality directly. I updated this model by adding every feature iteratively to this model.

It seems that this model doesn’t have any prediction affect because R-squared value parctically always equal to 0.3. This model proves that many features from this dataset don’t have relationships with quality, they have correlation between themselves mostly.

I see that some of variables are logarithmic, so later we can try to improve this model by updating these variables using log10 function.


Final Plots and Summary

Plot One

## Warning: Ignoring unknown parameters: binwidth, bins, pad

Description One

My choice of the first most interesting plot is a histogram of quality distribution. The distribution of red wine quality is nearly normal. Most red wines are among the medium quality.

Plot Two

Description Two

My choice of the second most interesting plot is a boxplot of correlation between quality and alcohol. I think that this plot shows the main conclusion of this paper that wine quality depends on the % of alcohol in it (higher quality of wine has higher % of alcohol in it).

Plot Three

Description Three

My choice of the 3rd most interesting plot is a line plot which shows corellation between 3 variables: volatile.acidity, citric.acid and quality. From this plot we can conclude that when volatile.acidity has higher values and citric.acid has lower values at the same time then quality of wine is lower than if volatile.acidity has lower values and citric.acid has higher values at the same time.


Reflection

When I started to investigate this dataset I was sure that every chemical variable will have impact on a wine quality. But after I built bivariate plots I was shocked that only one variable (alcohol) has moderate correlation with quality.

The most interesting thing that I discovered is that some of other variables are correlated to each other. I was frustarted and decided to investigate them further. Maybe they together will have a big impact on the quality. First, I decided to create line plots for multivariate analysis and to use quality label as a color for lines. These plots were not easy to analyse so I decided to experiment with scatterplots and play with regression line. This kind of plot helped me to see correlations between volatile.acidity, citric.acid and quality; fixed acidity, density and quality.

Also I tried to build a linear model to predict wine quality, and I didn’t receive any important results from it.

I think that to better understand relationships in this dataset I need to experiment with other combination of variables and to build other predictive models, for example using log10 function. These procedures may say something more about this dataset.

This dataset consists of 1599 observations of only one kind of red wine - Portuguese “Vinho Verde”. If we will add more observations of different kinds of red wines then we may see other correlations between variables.